[conflicted] Will prefer dplyr::select over any other package.
[conflicted] Will prefer dplyr::filter over any other package.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We learned about common univariate and multivariate distributions. For each of the distributions, there are well-defined and straightforward ways to sample values from the distribution. We can also manipulate these distributions to calculate probabilities.
The real world is complicated and we will quickly come across data where we struggle to find a common probability distributions.
Figure Figure 2.1 shows a relative frequency histogram for the duration of eruptions at Old Faithful in Yellowstone National Park.
[conflicted] Will prefer dplyr::select over any other package.
[conflicted] Will prefer dplyr::filter over any other package.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This distribution looks very complicated. But what if we break this distribution into pieces. In this case, what if we broke it into two normal distributions?
# show geyser as two normal distribution
library(mclust)
gmm_geyser <- Mclust(data = dplyr::select(geyser, duration), G = 2)
bind_cols(
geyser,
cluster = gmm_geyser$classification
) |>
ggplot(aes(duration, y = after_stat(density), fill = factor(cluster))) +
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A latent variable is a variable that isn’t directly observed but can be inferred through other variables and modeling. Sometimes the latent variable is meaningful but unobserved. Sometimes it isn’t meaningful.
Latent variables are sometimes called hidden variables.
Breaking complex problems into smaller pieces is good. These latent variables will allow us to do some cools things:
In this set of notes, we’ll use latent variables to
Let’s consider a “data generation story” different than anything we considered in Chapter 5. Here, we sample in two stages (Hastie, Tibshirani, and Friedman 2009).
This new sampling procedure aligns closely with the idea of hierarchical sampling and hierarchical models. It is also sometimes called ancestral sampling [ancestral sampling: sample z_hat, sample x_hat|z_hat (Bishop 430)].
This two-step approach dramatically increases the types of distributions at our disposal because we are no longer limited to common univariate distributions like the normal distribution and uniform distribution. The two-step approach is also the foundation of two related tools:
With mixture distributions, we care about the overall distribution and don’t care about the latent variables.
With mixture modeling, we use the overall distribution to learn about the latent variables/subpopulations/clusters in the data.
A mixture distribution is a probabilistic model that is a linear combination of common probability distributions.
A discrete mixture distribution can be expressed as
\[ p_{mixture}(x) = \sum_{k = 1}^K \pi_kp(x) \]
where \(K\) is the number of mixtures and \(\pi_k\) is the weight of each PMF included in the mixture distribution.
Let’s consider a concrete example with a Bernoulli distribution and two normal distributions.
Now, let’s sample from a Bernoulli distribution and then sample from one of two normal distributions using R code.
generate_data <- function(n) {
step1 <- sample(x = c(0, 1), size = n, replace = TRUE, prob = c(0.75, 0.25))
step1 <- sort(step1)
step2 <- c(
rnorm(n = sum(step1 == 0), mean = 0, sd = 2),
rnorm(n = sum(step1 == 1), mean = 5, sd = 1)
)
tibble::tibble(
x = step1,
y = step2
)
}
set.seed(1)
generate_data(n = 1000) |>
ggplot(aes(x = y, y = after_stat(density))) +
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This marginal distribution looks complex but the process of creating the marginal distribution is simple.
In fact, consider this quote from Bishop (2006) (Page 111):
By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy.
A component is each common probability distribution that is combined to create a mixture distribution. For example, a mixture of two Gaussian distributions has two components.
A mixing coefficient is the probability associated with a component with a component in a mixture distribution. Mixing coefficients must sum to 1.
We’ll use \(\pi_k\) for population mixing coefficients and \(p_k\) for sample mixing coefficients. Mixing coefficients are also called mixing weights and mixing probabilities.
Mixture distributions are often overparamaterized. For a univariate mixture of normals with \(k\) components, we have \(k\) means, \(k\) standard deviations, and \(k\) mixing coefficients.
Suppose we used statistical inference to infer some parameters for the geysers example above.
rsqrt(gmm_geyser\(parameters\)variance$sigmasq[1])and $s_2 =$r sqrt(gmm_geyser$parameters$variance$sigmasq[2])The mixture density is
\[ f_{mixture}(x) = p_1f(x|\mu = \bar{x_1}, \sigma = s_1) + p_2f(x|\mu = \bar{x_2},\sigma=s_2) \tag{2.1}\]
geyser_density <- function(x, model) {
probs <- model$parameters$pro
d1 <- dnorm(
x,
mean = model$parameters$mean[1],
sd = sqrt(model$parameters$variance$sigmasq[1])
)
d2 <- dnorm(
x,
mean = model$parameters$mean[2],
sd = sqrt(model$parameters$variance$sigmasq[2])
)
probs[1] * d1 + probs[2] * d2
}
mm <- tibble(
x = seq(0, 5, 0.01),
f_x = map_dbl(x, geyser_density, model = gmm_geyser)
)
ggplot() +
geom_histogram(data = geyser, mapping = aes(x = duration, y = after_stat(density))) +
geom_line(data = mm, mapping = aes(x, f_x), color = "red") +
labs(
title = "",
subtitles = "Observed data in black, inferred distribution in red"
)`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The multivariate normal distribution is a higher-dimensional version of the normal distribution. The MVN distribution has a vector of means of length \(k\) and a \(k\)-by-\(k\) variance-covariance matrix.
We show that a random vector is multivariate normally distributed with
\[ \vec{X} \sim \mathcal{N}(\vec\mu, \boldsymbol\Sigma) \tag{2.2}\]
The PDF of a multivariate normally distributed random variable is
\[ f(x) = (2\pi)^{-k/2}det(\boldsymbol\Sigma)^{-1/2}\exp\left(-\frac{1}{2}(\vec{x} - \vec\mu)^T\boldsymbol\Sigma^{-1}(\vec{x} - \vec\mu)\right) \tag{2.3}\]
K-Means Clustering is a heuristic-based approach to finding latent groups in data. The algorithm assigns each observation to one and only one group through a two step iteration that minimizes the Euclidean distance between observations and centroids for each group.
Consider the following data set.
Step 1: Randomly place K centroids in your n-dimensional vector space
Step 2: Calculate the nearest centroid for each point using a distance measure
Step 3: Assign each point to the nearest centroid
Step 4: Recalculate the position of the centroids based on the means of the assigned points
Step 5: Repeat steps 2-4 until no points change cluster assignments
Until now, we’ve assumed that we’ve known all parameters when working with mixture distributions. What if we want to learn these parameters/make inferences about these parameters?
The process of making inferences about latent groups is related to K-Means Clustering. While K-Means Clustering is heuristic based, mixture modeling formalize the process of making inferences about latent groups using probability models. Gaussian mixture models (GMM) are a popular mixture model.
Likelihood: The likelihood is how probable observed data are given a set of parameters. If \(\theta\) is a vector of parameters, then \(L(\theta |x) = f(x |\theta)\) is the likelihood function. The likelihood is not a probability. In fact, we often take the log of the likelihood for computational reasons.
We often don’t know the exact number of latent groups in the data. We need a way to compare models with varying numbers of groups. Simply picking the model with the maximum likelihood will lead to models with too many groups.
The Bayesian information criterion (BIC) is an alternative to likelihoods that penalizes models for having many parameters. Let \(L\) be the likelihood, \(m4\) the number of free parameters, and \(n\) the number of observations.
\[ BIC = -2log(L) + mlog(n) \tag{2.4}\]
We will choose models that minimize BIC. Ideally, we will use v-fold cross validation for this process.
EXAMPLE
Let’s consider a data generation story based on the Bernoulli distribution. Now, each variable, \(X_1, X_2, ..., X_D\), is draw from a mixture of \(K\) Bernoulli distributions.
\[ X_d = \begin{cases} Bern(p_1) \text{ with probability }\pi_1 \\ Bern(p_2) \text{ with probability }\pi_2 \\ \vdots \\ Bern(p_K) \text{ with probability }\pi_K \end{cases} \tag{2.5}\]
Let \(i\) be an index for each mixture that contributes to the random variable. The probability mass function of the random variable is written as
\[ P(X_d) = \Pi_{i = 1}^Kp_i^{x_i} (1 - p_i)^{1 - x_i} \tag{2.6}\]
Let’s consider a classic example from Bishop (2006) and Murphy (2022). The example uses the MNIST database, which contains 70,000 handwritten digits. The digits are stored in 784 variables, from a 28 by 28 grid, with values ranging from 0 to 255, which indicate the darkness of the pixel.
To prepare the data, we divide each pixel by 255 and then turn the pixels into indicators with values under 0.5 as 0 and values over 0.5 as 1. Figure Figure 2.2 visualizes the first four digits after reading in the data and applying pre-processing.
source(here::here("R", "visualize_digit.R"))
mnist <- read_csv(here::here("data", "mnist_binary.csv"))
glimpse(dplyr::select(mnist, 1:10))Rows: 60,000
Columns: 10
$ label <dbl> 5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9, 4…
$ pix_28_1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_7 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_8 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pix_28_9 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
visualize_digit(mnist, 1)
visualize_digit(mnist, 2)
visualize_digit(mnist, 3)
visualize_digit(mnist, 4)The digits are labelled in the MNIST data set but we will ignore the labels and use Bernoulli Mixture Modeling to learn the latent labels or groups. We will treat each pixel as its own Bernoulli distribution and cluster observations using mixtures of 784 Bernoulli distributions. This means each cluster will contain \(784\) parameters.
Let’s start with a simple example using just the digits “1” and “8”. We’ll use library(flexmix) by Leisch (2004). library(flexmix) is powerful but uses different syntax than we are used to.
flexmix() expects a matrix.~.model = FLXMCmvbinary().The starting assignments are random, so we set a seed.
The MNIST data are already labelled, so we can compare our assignments to the labels if we convert the “soft assignments” to “hard assignments”. Note that most applications won’t have labels.
mnist |>
filter(label %in% c("1", "8")) |>
bind_cols(cluster = mnist_18_clust@cluster) |>
count(label, cluster)# A tibble: 4 × 3
label cluster n
<dbl> <int> <int>
1 1 1 482
2 1 2 6260
3 8 1 5610
4 8 2 241
Figure 2.3 shows the estimated \(p_i\) for each pixel for each cluster. The figure shows 784 \(p_i\) for \(k = 1\) and 784 \(p_i\) for \(k = 2\). We see that the estimated parameters closely resemble the digits.
Of course, each digit can differ from these images because everyone writes differently. In some ways, these are average digits acorss many version of the digits.
The BMM does a good job of labeling the digits and recovering the average shape of the digits.
Let’s now consider an example that uses all 10 digits.
In most applications, we won’t know the number of latent variables. First, we sample 1,0001 digits and run the model with \(k = 2, 3, ..., 12\). We’ll calculate the BIC for each hyperparameter and pick the \(k\) with lowest BIC.
\(k = 7\) provides the lowest BIC. This is probably because digits like 3 and 8 are very similar. Next, we run the BMM on the full data with \(k = 7\).
The MNIST data are already labelled, so we can compare our assignments to the labels if we convert the “soft assignments” to “hard assignments”. Note that most applications won’t have labels.
1 2 3 4 5 6 7
0 5 357 282 289 4875 1 114
1 36 288 40 35 0 6319 24
2 114 166 652 73 50 163 4740
3 263 473 4786 80 41 260 228
4 3384 1779 4 139 7 53 476
5 325 2315 2367 173 109 59 73
6 9 86 41 4365 42 128 1247
7 3560 2395 21 0 25 234 30
8 257 2582 2369 57 32 445 109
9 3739 1797 109 5 26 136 137
Figure 2.4 shows the estimated \(p_i\) for each pixel for each cluster. The following visualize the \(784K\) parameters that we estimated. It shows 784 \(p_i\) for \(k = 1, 2, ..., 7\) clusters. We see that the estimated parameters closely resemble the digits.
means <- rbind(
t(parameters(mnist_clust, component = 1)),
t(parameters(mnist_clust, component = 2)),
t(parameters(mnist_clust, component = 3)),
t(parameters(mnist_clust, component = 4)),
t(parameters(mnist_clust, component = 5)),
t(parameters(mnist_clust, component = 6)),
t(parameters(mnist_clust, component = 7))
) |>
as_tibble() |>
mutate(label = NA)
visualize_digit(means, 1)
visualize_digit(means, 2)
visualize_digit(means, 3)
visualize_digit(means, 4)
visualize_digit(means, 5)
visualize_digit(means, 6)
visualize_digit(means, 7)The example with all digits doesn’t result in 10 distinct mixtures but it does a fairly good job of structuring finding structure in the data. Without labels and considering the variety of messy handwriting, this is a useful model.
State human services typology
employed vs. unemployed vs. retired? yes/no on types of income
shopping cart
Rows: 2000 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (42): 7up, lasagna, pepsi, yop, red.wine, cheese, bbq, bulmers, mayonnai...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
challenges: 1. assumes a model 2. computation – overparameterized
This is solely to save computation time.↩︎